The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Explore and visualize the dataset.
Build a classification model to predict if the customer is going to churn or not
Optimize the model using appropriate techniques
Generate a set of insights and recommendations that will help the bank
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To impute missing values
from sklearn.impute import KNNImputer
# To build a logistic regression model
from sklearn.linear_model import LogisticRegression
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Read the data file
data = pd.read_csv("BankChurners.csv")
# inspecting shape of dataset
data.shape
# Creating a copy of data object
bcData = data.copy()
# Inspecting top 5 rows
bcData.head()
# Inspecting last 5 rows
bcData.tail()
# Inspecting columsn data types and missign feature values
bcData.info()
# let's check for duplicate values in the data
bcData.duplicated().sum()
There is no duplicate rows.
# Check for missing values ratio for features
round(bcData.isnull().sum() / bcData.isnull().count() * 100)
# let's view the statistical summary of the numerical columns in the data
bcData.describe().T
bcData.drop(columns=["CLIENTNUM"], inplace=True)
# Analyzing the categorical variables in data
cats = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# Printing unique values in each category
for cat in cats:
print(bcData[cat].value_counts())
print("-" * 40)
bcData[
(bcData["Income_Category"] == "abc")
| (bcData["Education_Level"].isnull())
| (bcData["Marital_Status"].isnull())
]
# Merging Education_Level: Post-Graduate and Doctorate into Post-Graduate
bcData["Education_Level"] = bcData["Education_Level"].replace(
["Doctorate"], "Post-Graduate"
)
# Replacing Income_Category = 'abc' with 'Unknown'
bcData["Income_Category"] = bcData["Income_Category"].replace(["abc"], "Unknown")
# Printing unique values in each category
for cat in cats:
print(bcData[cat].value_counts())
print("-" * 40)
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Observations on Customer age
histogram_boxplot(bcData, "Customer_Age")
# Checking 10 largest values of age
bcData.Customer_Age.nlargest(10)
bcData[bcData["Customer_Age"] > 66]
bcData["Customer_Age"].clip(upper=66, inplace=True)
# Observations on Customer age
histogram_boxplot(bcData, "Dependent_count")
# Observations on Months_on_book
histogram_boxplot(bcData, "Months_on_book")
# Observations on Total_Relationship_Count
histogram_boxplot(bcData, "Total_Relationship_Count")
# Observations on Months_Inactive_12_mon
histogram_boxplot(bcData, "Months_Inactive_12_mon")
# Observations on Contacts_Count_12_mon
histogram_boxplot(bcData, "Contacts_Count_12_mon")
# Observations on Credit_Limit
histogram_boxplot(bcData, "Credit_Limit")
# Observations on Total_Revolving_Bal
histogram_boxplot(bcData, "Total_Revolving_Bal")
# Observations on Avg_Open_To_Buy
histogram_boxplot(bcData, "Avg_Open_To_Buy")
# Observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(bcData, "Total_Amt_Chng_Q4_Q1")
# Checking values greate than 2.25
bcData[bcData["Total_Amt_Chng_Q4_Q1"] > 2.25]
bcData.Total_Amt_Chng_Q4_Q1.clip(upper=2.25, inplace=True)
# Observations on Total_Trans_Amt
histogram_boxplot(bcData, "Total_Trans_Amt")
# Observations on Total_Trans_Ct
histogram_boxplot(bcData, "Total_Trans_Ct")
# Checking values greater than 134
bcData[bcData["Total_Trans_Ct"] > 134]
# Capping values for 134
data["Total_Trans_Ct"].clip(upper=134, inplace=True)
# Observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(bcData, "Total_Ct_Chng_Q4_Q1")
# Checking values greate than 2.5
bcData[bcData["Total_Ct_Chng_Q4_Q1"] > 2.5]
bcData.Total_Ct_Chng_Q4_Q1.clip(upper=2.5, inplace=True)
# Observations on Avg_Utilization_Ratio
histogram_boxplot(bcData, "Avg_Utilization_Ratio")
# function to create labeled barplots
def labeled_barplot(data, feature, n=None):
"""
Barplot with percentage and count at the top
data: dataframe
feature: dataframe column
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
perc = 0
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
perc = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() + 40 # height of the plot
ax.annotate(
perc + " / " + str(label),
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Observations on Marital_Status
labeled_barplot(bcData, "Attrition_Flag", 7)
# Observations on Gender
labeled_barplot(bcData, "Gender", 7)
# Observations on Education_Level
labeled_barplot(bcData, "Education_Level", 7)
# Observations on Marital_Status
labeled_barplot(bcData, "Marital_Status", 7)
# Observations on Income_Category
labeled_barplot(bcData, "Income_Category", 7)
# Observations on Card_Category
labeled_barplot(bcData, "Card_Category", 7)
sns.pairplot(bcData, hue="Attrition_Flag")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Total_Revolving_Bal", x="Gender", data=bcData, orient="vertical")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Credit_Limit", x="Gender", data=bcData, orient="vertical")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Total_Trans_Ct", x="Gender", data=bcData, orient="vertical")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Total_Trans_Amt", x="Gender", data=bcData, orient="vertical")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(
y="Total_Revolving_Bal", x="Income_Category", data=bcData, orient="vertical"
)
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Total_Trans_Amt", x="Income_Category", data=bcData, orient="vertical")
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(
y="Avg_Utilization_Ratio", x="Card_Category", data=bcData, orient="vertical"
)
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(
y="Avg_Utilization_Ratio", x="Income_Category", data=bcData, orient="vertical"
)
cols = bcData[
[
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
]
].columns.tolist()
plt.figure(figsize=(15, 15))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.boxplot(bcData["Attrition_Flag"], bcData[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
cols = bcData[
[
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Total_Amt_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
].columns.tolist()
plt.figure(figsize=(15, 15))
for i, variable in enumerate(cols):
plt.subplot(3, 3, i + 1)
sns.boxplot(bcData["Attrition_Flag"], bcData[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 115)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(bcData, "Gender", "Attrition_Flag")
stacked_barplot(bcData, "Education_Level", "Attrition_Flag")
stacked_barplot(bcData, "Marital_Status", "Attrition_Flag")
stacked_barplot(bcData, "Income_Category", "Attrition_Flag")
stacked_barplot(bcData, "Card_Category", "Attrition_Flag")
plt.figure(figsize=(15, 7))
sns.heatmap(bcData.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# Dropping Avg_Open_To_buy column
bcData.drop(columns=["Avg_Open_To_Buy"], inplace=True)
# Separating target variable and other variables
X = bcData.drop(columns="Attrition_Flag")
X = pd.get_dummies(X)
y = bcData["Attrition_Flag"].apply(lambda x: 0 if x == "Existing Customer" else 1)
# Splitting data in Temp and Test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# Splitting data from temp set to training and validation set
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
We'll build different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV
models = [] # List of empty models
# Adding target models into list
models.append(("lg", LogisticRegression(random_state = 1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # List of empty results
names = [] # Empty list of names of models
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state = 1)
cv_result = cross_val_score(estimator= model, X=X_train, y=y_train, scoring=scoring, cv=kfold)
results.append(cv_result)
names.append(name)
print("{0} : {1}".format(name, cv_result.mean()*100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{0} : {1}".format(name,scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
We will tune Gradinet Boost, Xgboost and AdaBoost models using RandomizedSearchCV. We will also compare the performance of these the methods using randomized search.
using these two functions from case strudy to calculate metrics and confusion matrix.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# defining model.
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_jobs=-1,
n_iter=50,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
# building model with best parameters
gbm_tuned1 = GradientBoostingClassifier(
n_estimators=100, subsample=0.8, max_features=0.7, random_state=1
)
# Fit the model on training data
gbm_tuned1.fit(X_train, y_train)
# Calculating different metrics on train set
gbm_random_train = model_performance_classification_sklearn(
gbm_tuned1, X_train, y_train
)
print("Training performance:")
gbm_random_train
# Calculating different metrics on validation set
gbm_random_val = model_performance_classification_sklearn(gbm_tuned1, X_val, y_val)
print("Validation performance:")
gbm_random_val
# creating confusion matrix
confusion_matrix_sklearn(gbm_tuned1, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_adb_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_adb_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_adb_cv.best_params_,randomized_adb_cv.best_score_))
# building model with best parameters
adb_tuned1 = AdaBoostClassifier(
n_estimators=90,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned1.fit(X_train, y_train)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned1, X_train, y_train
)
print("Training performance:")
Adaboost_random_train
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(adb_tuned1, X_val, y_val)
print("Validation performance:")
Adaboost_random_val
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned1, X_val, y_val)
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned1 = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned1.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned1.best_params_,xgb_tuned1.best_score_))
# building model with best parameters
xgb_tuned1 = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
gamma=0,
subsample=0.8,
learning_rate=0.1,
eval_metric="logloss",
max_depth=2,
reg_lambda=10,
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
# Calculating different metrics on train set
xgboost_random_train = model_performance_classification_sklearn(
xgb_tuned1, X_train, y_train
)
print("Training performance:")
xgboost_random_train
# Calculating different metrics on validation set
xgboost_random_val = model_performance_classification_sklearn(xgb_tuned1, X_val, y_val)
print("Validation performance:")
xgboost_random_val
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned1, X_val, y_val)
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
results = [] # Intialize the list with empty results
names = [] # Intialize the list with empty model names
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{0} : {1}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{0} : {1}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# defining model.
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_jobs=-1,
n_iter=50,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
# building model with best parameters
gbm_tuned1_over = GradientBoostingClassifier(
n_estimators=30, subsample=1, max_features=0.8, random_state=1
)
# Fit the model on training data
gbm_tuned1_over.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
gbm_random_train_over = model_performance_classification_sklearn(
gbm_tuned1_over, X_train_over, y_train_over
)
print("Training performance:")
gbm_random_train_over
# Calculating different metrics on validation set
gbm_random_val_over = model_performance_classification_sklearn(
gbm_tuned1_over, X_val, y_val
)
print("Validation performance:")
gbm_random_val_over
# creating confusion matrix
confusion_matrix_sklearn(gbm_tuned1_over, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# building model with best parameters
ada_tuned1_over = AdaBoostClassifier(
n_estimators=100,
learning_rate=0.01,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
ada_tuned1_over.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
ada_random_train_over = model_performance_classification_sklearn(
ada_tuned1_over, X_train_over, y_train_over
)
print("Training performance:")
ada_random_train_over
# Calculating different metrics on validation set
ada_random_val_over = model_performance_classification_sklearn(
ada_tuned1_over, X_val, y_val
)
print("Validation performance:")
ada_random_val_over
# creating confusion matrix
confusion_matrix_sklearn(ada_tuned1_over, X_val, y_val)
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned1 = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned1.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned1.best_params_,xgb_tuned1.best_score_))
# building model with best parameters
xgb_tuned1_over = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
gamma=1,
subsample=0.8,
learning_rate=0.05,
eval_metric="logloss",
max_depth=2,
reg_lambda=10,
)
# Fit the model on training data
xgb_tuned1_over.fit(X_train_over, y_train_over)
# Calculating different metrics on train set
xgb_random_train_over = model_performance_classification_sklearn(
xgb_tuned1_over, X_train_over, y_train_over
)
print("Training performance:")
xgb_random_train_over
# Calculating different metrics on validation set
xgb_random_val_over = model_performance_classification_sklearn(
xgb_tuned1_over, X_val, y_val
)
print("Validation performance:")
xgb_random_val_over
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned1_over, X_val, y_val)
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
results = [] # Intialize the list with empty results
names = [] # Intialize the list with empty model names
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{0} : {1}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{0} : {1}".format(name, scores))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# defining model.
model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"subsample": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=param_grid,
n_jobs=-1,
n_iter=50,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
# building model with best parameters
gbm_tuned1_un = GradientBoostingClassifier(
n_estimators=90, subsample=1, max_features=0.8, random_state=1
)
# Fit the model on training data
gbm_tuned1_un.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
gbm_random_train_un = model_performance_classification_sklearn(
gbm_tuned1_un, X_train_un, y_train_un
)
print("Training performance:")
gbm_random_train_un
# Calculating different metrics on validation set
gbm_random_val_un = model_performance_classification_sklearn(
gbm_tuned1_un, X_val, y_val
)
print("Validation performance:")
gbm_random_val_un
# creating confusion matrix
confusion_matrix_sklearn(gbm_tuned1_un, X_val, y_val)
- Recall score of gradient boost training score is close to validaiton score .
- But Precision score ihas large difference between them.
- Overall model is geranlized model.
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# building model with best parameters
ada_tuned1_un = AdaBoostClassifier(
n_estimators=90,
learning_rate=0.1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
ada_tuned1_un.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
ada_random_train_un = model_performance_classification_sklearn(
ada_tuned1_un, X_train_un, y_train_un
)
print("Training performance:")
ada_random_train_un
# Calculating different metrics on validation set
ada_random_val_un = model_performance_classification_sklearn(
ada_tuned1_un, X_val, y_val
)
print("Validation performance:")
ada_random_val_un
# creating confusion matrix
confusion_matrix_sklearn(ada_tuned1_un, X_val, y_val)
%%time
# defining model
model = XGBClassifier(random_state=1,eval_metric='logloss')
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,150,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(1,5,1),
'reg_lambda':[5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned1 = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned1.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned1.best_params_,xgb_tuned1.best_score_))
# building model with best parameters
xgb_tuned1_un = XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
gamma=1,
subsample=0.9,
learning_rate=0.01,
eval_metric="logloss",
max_depth=1,
reg_lambda=5,
)
# Fit the model on training data
xgb_tuned1_un.fit(X_train_un, y_train_un)
# Calculating different metrics on train set
xgb_random_train_un = model_performance_classification_sklearn(
xgb_tuned1_un, X_train_un, y_train_un
)
print("Training performance:")
xgb_random_train_un
# Calculating different metrics on validation set
xgb_random_val_un = model_performance_classification_sklearn(
xgb_tuned1_un, X_val, y_val
)
print("Validation performance:")
xgb_random_val_un
# creating confusion matrix
confusion_matrix_sklearn(xgb_tuned1_un, X_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
gbm_random_train.T,
gbm_random_train_over.T,
gbm_random_train_un.T,
Adaboost_random_train.T,
ada_random_train_over.T,
ada_random_train_un.T,
xgboost_random_train.T,
xgb_random_train_over.T,
xgb_random_train_un.T,
],
axis=1,
)
models_train_comp_df.columns = [
"GradientBoost Tuned Random search",
"GradientBoost Oversampled Random search",
"GradientBoost Undersampled Random search",
"AdaBoost Tuned Random search",
"AdaBoost Tuned Oversampled Random search",
"AdaBoost Tuned Undersampled Random search",
"Xgboost Tuned Random Search",
"Xgboost Tuned Oversampled Random Search",
"Xgboost Tuned Undersampled Random Search",
]
print("Training performance comparison:")
models_train_comp_df
# Validation performance comparison
models_val_comp_df = pd.concat(
[
gbm_random_val.T,
gbm_random_val_over.T,
gbm_random_val_un.T,
Adaboost_random_val.T,
ada_random_val_over.T,
ada_random_val_un.T,
xgboost_random_val.T,
xgb_random_val_over.T,
xgb_random_val_un.T,
],
axis=1,
)
models_val_comp_df.columns = [
"GradientBoost Tuned Random search",
"GradientBoost Oversampled Random search",
"GradientBoost Undersampled Random search",
"AdaBoost Tuned Random search",
"AdaBoost Tuned Oversampled Random search",
"AdaBoost Tuned Undersampled Random search",
"Xgboost Tuned Random Search",
"Xgboost Tuned Oversampled Random Search",
"Xgboost Tuned Undersampled Random Search",
]
print("Validation performance comparison:")
models_val_comp_df
AdaBoost Tuned Undersampled Random searchXgboost Tuned Random Search model# Calculating different metrics on the test set
xgboost_grid_test = model_performance_classification_sklearn(xgb_tuned1, X_test, y_test)
print("Test performance:")
xgboost_grid_test
feature_names = X.columns
importances = ada_tuned1_un.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# creating a list of numerical variables
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Total_Amt_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a list of categorical variables
categorical_features = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# Separating target variable and other variables
X = bcData.drop(columns="Attrition_Flag")
Y = bcData["Attrition_Flag"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
(
"AdaBoost",
AdaBoostClassifier(
n_estimators=90,
learning_rate=0.1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)